This document explores Tesla’s stock prices from its initial public offering on June 29, 2010, to December 31, 2019. Tesla’s share prices have experienced staggering increases over several years and machine learning algorithms could unveil trends of seasonality or autocorrelated variables. Predictive analysis, and within it, machine learning, can greatly influence investors from the uncertainty of the market. This time-series data aims to address the ability of machine learning models to use the time-series data to predict Tesla’s future market behavior. This investigation will utilize machine learning techniques on the historical prices to evaluate the direction of market movement, as well as forecast the value of the stock in the future. The research questions are as follows:

  1. How accurate is the ARIMA model with the addition of technical indicators from historical data?
  2. Which attribute(s) best predicts the direction of the stock market movement?

The future predictions will be compared to present day stock value to determine the accuracy of the algorithms, in addition to a review of the factors that have affected the stock prices to date (i.e. supply and demand, economy, stock splits, etc.).

Importing Libraries and Data

library(quantmod) 
library(dplyr)
library(lubridate)
library(summarytools)
library(dvmisc)
library(corrplot)
library(tseries)
library(forecast)
library(ggplot2)
library(plotly)
library(caret)
library(formattable)
library(dygraphs)
library(hrbrthemes)
library(TSstudio)
library(FSelector)

stock_list <- "TSLA"
start_date <- as.Date("2010-06-29")
end_date <- as.Date("2020-01-01")
tesla <- NULL

for (i in seq(length(stock_list))){
  getSymbols(stock_list, verbose = FALSE, src = "yahoo", 
             from=start_date,to=end_date)
  temp_df = as.data.frame(get(stock_list))
  temp_df$Date = row.names(temp_df)
  row.names(temp_df) = NULL
  colnames(temp_df) = c("Open", "High", "Low", "Close", 
                        "Volume", "Adjusted", "Date")
  temp_df = temp_df[c("Date", "Open", "High", 
                      "Low", "Close", "Volume", "Adjusted")]
  tesla = temp_df
}

In considering the COVID-19 pandemic, the Tesla dataset was only selected to include the years from 2010-2019. Due to the uncertainty surrounding the economy in 2020, I believe there will be white noise present in the 2020 data.

Data Preparation and Exploration

The data analysis stage first consists of cleaning, and inspecting the data for inconsistencies. Following these steps, the data may undergo transformations and modelling as required. As part of the data preparation stage, the following steps will be taken:

head(tesla)
##         Date  Open  High   Low Close   Volume Adjusted
## 1 2010-06-29 3.800 5.000 3.508 4.778 93831500    4.778
## 2 2010-06-30 5.158 6.084 4.660 4.766 85935500    4.766
## 3 2010-07-01 5.000 5.184 4.054 4.392 41094000    4.392
## 4 2010-07-02 4.600 4.620 3.742 3.840 25699000    3.840
## 5 2010-07-06 4.000 4.000 3.166 3.222 34334500    3.222
## 6 2010-07-07 3.280 3.326 2.996 3.160 34608500    3.160
str(tesla)
## 'data.frame':    2394 obs. of  7 variables:
##  $ Date    : chr  "2010-06-29" "2010-06-30" "2010-07-01" "2010-07-02" ...
##  $ Open    : num  3.8 5.16 5 4.6 4 ...
##  $ High    : num  5 6.08 5.18 4.62 4 ...
##  $ Low     : num  3.51 4.66 4.05 3.74 3.17 ...
##  $ Close   : num  4.78 4.77 4.39 3.84 3.22 ...
##  $ Volume  : num  93831500 85935500 41094000 25699000 34334500 ...
##  $ Adjusted: num  4.78 4.77 4.39 3.84 3.22 ...
tesla$Date <- as.Date(tesla$Date) 
class(tesla$Date)
## [1] "Date"

The ‘Date’ attribute was changed to represent a date type variable.

sum(is.na(tesla))
## [1] 0

There are no missing values.

Next, a correlation plot will determine whether the attributes are correlated and to what degree.

##               Open      High       Low     Close    Volume  Adjusted
## Open     1.0000000 0.9995588 0.9995524 0.9990113 0.4618287 0.9990113
## High     0.9995588 1.0000000 0.9994779 0.9996208 0.4710375 0.9996208
## Low      0.9995524 0.9994779 1.0000000 0.9995621 0.4526298 0.9995621
## Close    0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000
## Volume   0.4618287 0.4710375 0.4526298 0.4624999 1.0000000 0.4624999
## Adjusted 0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000

To understand the attributes further, the measures of central tendency will be reviewed.


Descriptive Statistics of Tesla Stock

Min Q1 Median Mean Q3 Max Std.Dev
Open 3.23 6.85 42.42 36.62 52.80 87.00 22.89
High 3.33 6.96 43.19 37.26 53.55 87.06 23.24
Low 3.00 6.70 41.61 35.96 52.02 85.27 22.52
Close 3.16 6.87 42.32 36.63 52.78 86.19 22.90
Volume (M) 0.59 9.37 22.65 27.17 36.26 185.82 23.60
Adjusted 3.16 6.87 42.32 36.63 52.78 86.19 22.90


These values allow us to see the range of values that are present in the Tesla stocks over time. Notably, the range of the minimum and maximum stock values is quite large, likely due to trends over several years. Since this is a time-series dataset from the intial public offering, it is unlikely that outliers are present, since the value of the stock has changed drastically over several years.

For the purposes of the forecasting investigation, the closed stock price (i.e. the value of the stock at the end of the day) will be used. The table below displays the trends of the closed stock prices from 2010-2019.


Closed Tesla Stock Price Statistics

Year Min Max Average % Change per Fiscal Year
2010 3.16 7.09 4.67 11.47
2011 4.37 6.99 5.36 7.29
2012 4.56 7.60 6.23 20.62
2013 6.58 38.67 20.88 325.42
2014 27.87 57.21 44.67 48.17
2015 37.00 56.45 46.01 9.44
2016 28.73 53.08 41.95 -4.35
2017 43.40 77.00 62.86 43.49
2018 50.11 75.91 63.46 3.83
2019 35.79 86.19 54.71 34.89


From this point, the data will be split into training and test sets. However, prior to splitting the data, a response variable to track the trends in the market must be calculated. In the case of reviewing the trends of the stock market, the stock’s return will be categorized into one of five classes. These classes will represent the quartiles of the daily returns.

Further analysis will be conducted on the training set.

# Generating the daily return
for (i in 2:nrow(tesla)){
  tesla$Return[i] <- 
    (tesla$Close[i] - tesla$Close[i-1])/tesla$Close[i-1] 
}

# Assigning a response variable
normalize <- function(x){
  return ((x - min(x)) / (max(x) - min(x)))
}

tesla <- na.omit(tesla)
tesla$Return <- normalize(tesla[,8]) 
tesla$Response <- quant_groups(tesla$Return, groups = 5)
## Observations per group: 479, 478, 479, 478, 479. 0 missing.
summary(tesla$Response)
##     [0,0.398] (0.398,0.432] (0.432,0.457] (0.457,0.493]     (0.493,1] 
##           479           478           479           478           479
# Setting training and test sets
train.set <- tesla %>%
  filter(Date >= as.Date("2010/06/30") & Date <= as.Date("2016/12/31"))

test.set <- tesla %>%
  filter(Date >= as.Date("2017/01/01") & Date <= as.Date("2019/12/31"))


Next, visualizations will be used to observe trends in the data. In order to ensure that the most accurate forecasts are obtained from the analysis, there are several aspects to consider when working with a time series dataset. The following data exploration will determine:

The first two visualizations will display the closing price of Tesla stock per day.



Histogram of Closing Price

Now that the closing stock prices have been visualized, it is clear that the data is not normally distributed. Further tesing for stationarity is shown below.


Testing for Stationary:

The Augmented Dickey-Fuller Test is a statistical test that determines if the data is stationary. A p-value of less than 0.05 is indicative of stationary data. With stationary data there is a constant mean and constant variance.

adf.test(train.set$Close)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train.set$Close
## Dickey-Fuller = -2.0211, Lag order = 11, p-value = 0.5694
## alternative hypothesis: stationary

Based on the p-value of 0.5711, the Tesla time-series is non-stationary. Differencing will be used to adjust for this discovery. The inbuilt function of ndiffs will calculate the number of differences required to make the time-series data stationary. Differencing can be completed using the diff function. The visualizations below will be used to determine which transformation is best suited for the data. The differenced data will undergo the Augmented Dickey-Fuller Test again.

ndiffs(train.set$Close, test = "adf")
## [1] 1

Based on the plots above, the log differenced data appears to be best performing metric to stabilize the variance of the time series.

adf.test(diff(log(train.set$Close)))
## Warning in adf.test(diff(log(train.set$Close))): p-value smaller than
## printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(log(train.set$Close))
## Dickey-Fuller = -11.137, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

Based on the differencing of 1 lag and a logarithmic transformation, the time series is now a stationary process.


Autocorrelation:

It is important to determine if autocorrelation is present within the data. Autocorrelation refers to the degree of linear similarity that is present between the data and a lagged version of the past data. In other words, this assessment determine if the data is previous data influenced the current observations. The following tests will compare the original and stationary data for the presence of autocorrelation.

The above plot (right) shows that the autocorrelation falls to zero quickly, meaning that the data is now stationary. The autocorrelation is within the dashed blue line which indicates that there is no longer a lag that is correlated with the data series. The partical autocorrelation plot below confirms this conclusion.

pacf(diff(log(train.set$Close)), main = " PACF Series: Log Differenced Price")

Since the log differenced closing price results in a stationary dataset, the values will be saved to a new variable to use in the modelling.

train_diff <- train.set %>% 
  select(Date, Open, High, Low, Close, Adjusted) %>% 
  mutate(Logdiff = c(NA, diff(log(Close)))) %>% 
  na.omit() 


Presence of Seasonality:

# 253 is the average number of trading days 
tesla_ts <- xts(train_diff$Logdiff, train_diff$Date)
ts_info(tesla_ts)
##  The tesla_ts series is a xts object with 1 variable and 1638 observations
##  Frequency: daily 
##  Start time: 2010-07-01 
##  End time: 2016-12-30
temp <- ts(tesla_ts,start=c(2010,1),freq=253)

ts_decompose(temp)


Normalization:

train_norm <- normalize(train.set[,c(2:5)])
train_norm$Response <- train.set$Response

Technical Indicators

The ARIMA model requires additional technical indicators to be calculated to strengthen the available information. These indicators include the following:

for (i in 1:nrow(train_diff)){
  #MACD
  train_diff$MACD <- MACD(Cl(train_diff), nFast=12, nSlow=26, nSig=9)
  #RSI
  train_diff$RSI <- RSI(Cl(train_diff), n=14)
  #Price Rate of Change
  train_diff$ROC <- ROC(Cl(train_diff),n=14) * 100
  #Simple Moving Average
  train_diff$SMA <- SMA(Cl(train_diff),n=14)
  
  hlac <- data.frame(x=Hi(train_diff), y=Lo(train_diff), z=Cl(train_diff))
  
  #Stochastic Oscillator
  train_diff$STO <- stoch(hlac, nFastK = 14) *100
  #William's %R
  train_diff$WPR <- WPR(hlac, n=14) * (-100)

}

Modelling

ARIMA model:

Prior to determining how accurate the ARIMA model is with the addition of technical indicators, a base model will be built from the log differenced closing prices. This model will provide a point of comparison for the second ARIMA model.

Base Model

set.seed(10)
arima_model <- auto.arima(train_diff$Logdiff, ic = c("aicc", "aic", "bic"))
summary(arima_model)
## Series: train_diff$Logdiff 
## ARIMA(5,0,2) with non-zero mean 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     ar5     ma1     ma2    mean
##       -0.6993  -0.7996  -0.0250  -0.0246  0.0258  0.7207  0.7795  0.0013
## s.e.   0.3046   0.3899   0.0409   0.0355  0.0413  0.3042  0.3855  0.0008
## 
## sigma^2 estimated as 0.001091:  log likelihood=3266.22
## AIC=-6514.43   AICc=-6514.32   BIC=-6465.82
## 
## Training set error measures:
##                         ME       RMSE        MAE MPE MAPE      MASE
## Training set -1.697138e-06 0.03294306 0.02273582 NaN  Inf 0.6923193
##                     ACF1
## Training set 0.000893795
checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,0,2) with non-zero mean
## Q* = 3.6857, df = 3, p-value = 0.2975
## 
## Model df: 8.   Total lags used: 11
forecast1 <- forecast(arima_model)
accuracy(forecast1, test = test.set$Close)
##                       ME       RMSE        MAE MPE MAPE      MASE
## Training set 0.001128622 0.02747572 0.02226463 NaN  Inf 0.9951315
##                   ACF1
## Training set 0.2217782

Revised Model

arima_model2 <- auto.arima(train_diff$Logdiff, ic = c("aicc", "aic", "bic"), xreg = train_diff$ROC)
summary(arima_model2)
## Series: train_diff$Logdiff 
## Regression with ARIMA(0,0,0) errors 
## 
## Coefficients:
##        xreg
##       7e-04
## s.e.  1e-04
## 
## sigma^2 estimated as 0.0009607:  log likelihood=3330.79
## AIC=-6657.59   AICc=-6657.58   BIC=-6646.8
## 
## Training set error measures:
##                        ME       RMSE        MAE MPE MAPE      MASE
## Training set -8.75617e-05 0.03111951 0.02194144 NaN  Inf 0.6681297
##                      ACF1
## Training set -0.009184689
checkresiduals(arima_model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,0,0) errors
## Q* = 28.084, df = 9, p-value = 0.0009234
## 
## Model df: 1.   Total lags used: 10
forecast2 <- forecast(arima_model2, xreg = train_diff$ROC)
## Warning in forecast.forecast_ARIMA(arima_model2, xreg = train_diff$ROC):
## Upper prediction intervals are not finite.
accuracy(forecast2, test = test.set$Close)
##                        ME       RMSE        MAE MPE MAPE      MASE
## Training set 0.0007673625 0.02713484 0.02206183 NaN  Inf 0.9860671
##                   ACF1
## Training set 0.2257881

K-Nearest Neighbours:

set.seed(1)
knn_model <- train(Response ~ High + Low + Open + Close, 
                   data = train_norm, 
                   method = "knn")
                   
                   

predictions <- predict(knn_model, newdata = test.set)
confusionMatrix(predictions, test.set$Response)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      [0,0.398] (0.398,0.432] (0.432,0.457] (0.457,0.493]
##   [0,0.398]             0             0             0             0
##   (0.398,0.432]         1             2             2             1
##   (0.432,0.457]        70            78            79            73
##   (0.457,0.493]        33            28            36            39
##   (0.493,1]            49            40            33            46
##                Reference
## Prediction      (0.493,1]
##   [0,0.398]             0
##   (0.398,0.432]         2
##   (0.432,0.457]        70
##   (0.457,0.493]        34
##   (0.493,1]            38
## 
## Overall Statistics
##                                          
##                Accuracy : 0.2095         
##                  95% CI : (0.181, 0.2404)
##     No Information Rate : 0.2109         
##     P-Value [Acc > NIR] : 0.5499         
##                                          
##                   Kappa : 0.0126         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
## 
## Statistics by Class:
## 
##                      Class: [0,0.398] Class: (0.398,0.432]
## Sensitivity                    0.0000             0.013514
## Specificity                    1.0000             0.990099
## Pos Pred Value                    NaN             0.250000
## Neg Pred Value                 0.7971             0.804290
## Prevalence                     0.2029             0.196286
## Detection Rate                 0.0000             0.002653
## Detection Prevalence           0.0000             0.010610
## Balanced Accuracy              0.5000             0.501806
##                      Class: (0.432,0.457] Class: (0.457,0.493]
## Sensitivity                        0.5267              0.24528
## Specificity                        0.5182              0.77983
## Pos Pred Value                     0.2135              0.22941
## Neg Pred Value                     0.8151              0.79452
## Prevalence                         0.1989              0.21088
## Detection Rate                     0.1048              0.05172
## Detection Prevalence               0.4907              0.22546
## Balanced Accuracy                  0.5224              0.51256
##                      Class: (0.493,1]
## Sensitivity                    0.2639
## Specificity                    0.7246
## Pos Pred Value                 0.1845
## Neg Pred Value                 0.8066
## Prevalence                     0.1910
## Detection Rate                 0.0504
## Detection Prevalence           0.2732
## Balanced Accuracy              0.4942

Feature Selection

Feature selection will be used to find the best model for the classification algorithm. This process will determine which attribute(s) best predicts the response of the next day’s market. The ARIMA model uses the Akaike information criterion (AIC) to best fit the model. Two other feature selection methods, a filter-based and wrapper-based technique are shown below.

# Filter-based Feature Selection Technique
# Correlation Based
cfs(Response ∼ High + Low + Open + Close, data = train_norm)
## [1] "High"
# Wrapper-based Feature Selection Technique
#rfe(x = train_norm[, 1:4], y = train_norm[, 5], rfeControl = control, metric = "ROC")

Performance Evaluation